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ABSTRACT It is a challenging task to infer selection intensity and allele age from population genetic data. 
Here we present a method that can efficiently estimate selection intensity and allele age from the multilocus 
haplotype structure in the vicinity of a segregating mutant under positive selection. We use a structured- 
coalescent approach to model the effect of directional selection on the gene genealogies of neutral 
markers linked to the selected mutant. The frequency trajectory of the selected allele follows the Wright- 
Fisher model. Given the position of the selected mutant, we propose a simplified multilocus haplotype 
model that can efficiently model the dynamics of the ancestral haplotypes under the joint influence of 
selection and recombination. This model approximates the ancestral genealogies of the sample, which 
reduces the number of states from an exponential function of the number of single-nucleotide poly- 
morphism loci to a quadratic function. That allows parameter inference from data covering DNA regions as 
large as several hundred kilo-bases. Importance sampling algorithms are adopted to evaluate the 
probability of a sample by exploring the space of both allele frequency trajectories of the selected 
mutation and gene genealogies of the linked sites. We demonstrate by simulation that the method can 
accurately estimate selection intensity for moderate and strong positive selection. We apply the method to 
a data set of the G6PD gene in an African population and obtain an estimate of 0.0456 (95% confidence 
interval 0.0144—0.0769) for the selection intensity. The proposed method is novel in jointly modeling the 
multilocus haplotype pattern caused by recombination and mutation, allowing the analysis of haplotype 
data in recombining regions. Moreover, the method is applicable to data from populations under expo- 
nential growth and a variety of other demographic histories. 
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INTRODUCTION 

There is an increased interest in elucidating the role of natural selection 
in the evolution of human and other species using population genetic 
data. Evidence shows that selection has been actively shaping the 
genetic diversity of human populations during the process of adapta- 
tion to new environments and infectious diseases (Sabeti et al. 2002; 
Bersaglieri et al 2004; Tishkoff et al. 2007; Simonson et al. 2010; Yi 
et al. 2010; Peng et al. 2011; Xu et al. 2011; Kamberov et al. 2013). 
Selection in human populations can leave "footprints" in patterns of 
single-nucleotide polymorphisms (SNPs) in the vicinity of the selected 
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mutant. Numerous methods have been developed to detect natural 
selection based on such polymorphism patterns (Tajima 1989; Fu and 
Li 1993; Fay and Wu 2000; Kim and Stephan 2002; Sabeti et al. 2002; 
Nielsen et al. 2005; Voight et al. 2006; Sabeti et al. 2007; Tang et al. 
2007; Chen et al. 2010). However, only a few methods are available for 
inferring quantities of the selective process, such as selection intensity 
and allele age. Among the existing methods, some consider single 
markers linked to the selected locus (e.g., Slatkin 2001; Kim and 
Stephan 2002), whereas more sophisticated methods gain information 
by exploiting the haplotype structure of multiple marker loci. For 
example, Coop and Griffiths (2004) inferred selection intensity and 
allele age by analyzing mutations among different haplotypes along 
their genealogical history. Coop and Griffiths (2004) used an impor- 
tance sampling algorithm to explore possible gene genealogies. Re- 
combination is not allowed by their method, and thus it works only 
for nonrecombining regions. Rannala and Reeve (2004) extended their 
former likelihood approach for disease mapping (Rannala and Reeve 
2001) to estimate allele age of a mutant under neutrality using mul- 
tiple linked markers, and employed Markov Chain Monte Carlo to 
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generate the posterior distribution of allele age of the mutant. Slatkin 
(2008) presented a Bayesian method for jointly inferring selection 
intensity and allele age of the mutant using linkage disequilibria of 
multiple marker loci and generated the probability of data with an 
importance sampling algorithm. 

The aforementioned multilocus methods all require modeling the 
effect of selection on the genealogical structure of neutral markers under 
a coalescent framework, which can be done in two ways. The first 
approach is to use the Krone-Neuhauser ancestral selection graph (ASG; 
Krone and Neuhauser 1997). In the ASG, the genealogy of the selected 
allele is embedded in a branching-coalescing graph so that both selection 
and mutation can be incorporated in the graph. The ASG approach is 
useful for simulating genealogies under weak selection. For moderate or 
strong selection, the ASG becomes so large that the computation 
becomes intractable. The ASG method's performance was dramatically 
improved by truncating the ASG (Slade 2000) to avoid generating very 
large ASGs. This approach, however, has not been extended to the 
analysis of multiple linked neutral mutations. The second approach is 
the structured coalescent (Kaplan et al 1988; Hudson and Kaplan 1988), 
which generates historical frequency trajectories of the selected allele and 
then treats chromosomes carrying the mutant allele and nonmutant 
allele as subpopulations between which there exists "gene flow" caused 
by recombination. For alleles under balancing selection, the allele fre- 
quencies were assumed to be constant. For positively selected alleles, the 
allele frequency trajectories can either be generated by stochastic simu- 
lation or be approximated using deterministic equations. 

The aforementioned multilocus methods all adopted the struc- 
tured-coalescent model of selection to estimate the selection param- 
eters (Coop and Griffiths 2004; Slatkin 2008), which is also the model 
used in our proposed method. But the approach to sampling the allele 
frequency trajectory from its probability distribution in our method 
differs from these others. Coop and Griffiths (2004) generated random 
trajectories of selected mutations under the Moran model, which has 
the property of time reversibility under mutation and additive selec- 
tion (Watterson 1975) but works only for populations of constant size. 
Rannala and Reeve (2004) made an assumption that the historical 
allele frequencies of neutral markers are constant during the whole 
process, which may not hold for real populations, especially for 
markers under the hitch-hiking effect. Slatkin (2008) used a linear 
birth-and-death process to approximate the genealogical trees of hap- 
lotypes carrying selected mutants, which is an adequate approxima- 
tion for mutants in low frequency, but may not be suitable for 
common mutants. We use the Wright-Fisher model instead of the 
Moran model to generate allele frequency trajectories, and apply the 
importance sampling scheme in Slatkin (2001) to weight the trajecto- 
ries when estimating selection parameters. This allows us to model the 
selective sweeps in a population with time-varying size, and allows us 
to analyze both high- and low-frequency alleles under selection. 

One advantage of the proposed method over the existing methods 
is that we model the dynamics of ancestral haplotypes under the joint 
effects of selection, mutation and recombination during a selective 
sweep process. In contrast, Coop and Griffiths (2004) assumed no 
recombination in their model, whereas both Slatkin (2008) and Ran- 
nala and Reeve (2004) simplified the transitions among different 
multi-loci haplotypes induced by recombination. In particular, some 
of the recombination events between different types of haplotypes 
within the selected haplotype group were ignored. This restriction 
can cause significant bias when the selected allele is in medium or 
high frequency in the population. Our method explicitly describes the 
frequencies of different selected haplotypes over time during the se- 
lective process. Therefore, our proposed method applies to both high- 



and low-frequency mutants. Our method also requires some approx- 
imations to improve computational efficiency. 

As we will demonstrate in the section A simplified multi-locus 
model for haplotype structure, the model efficiently reduces the state 
space of ancestral haplotypes from an exponential function of the 
number of SNP loci to a quadratic function, and thus allows the in- 
ference of allele age and selection intensity from multi-SNP haplotypes 
spanning several hundred kilo-bases or even mega-bases affected by 
strong selections. We then modify the importance sampling method of 
Griffiths and Tavare (1994b) to obtain the probability of a sample 
configuration and to estimate the selection parameters by averaging 
over genealogies of the linked sites. Note that an alternative choice is to 
adopt the existing importance sampling algorithms developed for mul- 
tilocus ancestral recombination graph (ARG) under neutrality (Griffiths 
and Marjoram 1996; Fearnhead and Donnelly 2001) and incorporate 
the ARG into the structured-coalescent model. However, because of the 
large state space of genealogies that has to be explored by the impor- 
tance sampling algorithms for a multilocus ARG, this approach is in- 
tractable on a genomic scale of hundreds of kilobases. 

OVERVIEW OF THE METHOD 

Suppose the data consist of « samp ie haplotypes with known phase 
collected from the current population. If genotype data are collected, 
the phase can be estimated by available algorithms (Scheet and Ste- 
phens 2006). The haplotypes are divided into two groups: the selected 
haplotypes, which are the chromosomes carrying the selected allele, 
and the background haplotypes, which do not carry the selected allele. 
In this method, we view the coalescent process in the n selected 
haplotypes as in a structured subpopulation (see Figure 1). The n 
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Figure 1 An illustration of the structured-coalescent approach for 
modeling positive selection. The historical population sizes are indicated 
by the distance between the two dashed lines; and the allele frequency 
trajectory of the selected allele is indicated with a thin solid curve. The 
coalescent history of the selected locus with five derived lineages (solid 
bold lines) and five ancestral lineages (dotted bold lines) is super- 
imposed on the trajectory and population size curves. The present time, 
t = 0, is at the bottom. And the time at which the trajectory of the 
selected mutant merged to the population-size curve denotes the time 
when the selected mutant arose in the population, i.e., the allele age T. 
In the model presented in the main text, only the sub-genealogies in the 
selected allele groups (bold solid lines) are considered. 
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selected haplotypes are represented by a n by m matrix with "1" or "0" 
for each entry, ZV,-, corresponding to the allele type of the !th haplo- 
type at the jth SNP position. The position of the selected mutant is 
assumed to be known, as is the genetic distance between the ^th and 
j 2 thSNP, {r m , 1 < j b j 2 < m}. 

We model the effect of selection with the structured- coalescent 
scheme (see Figure 1). The likelihood function of the observed selected 
haplotype data, 23, can be computed from 

c(s) = p(23[s, r) = jfv(v\g)p(g\n)p(H\s, T)dgdn, (i) 

where Q denotes the genealogy and 7i the frequency trajectory of the 
selected mutant. Neither H nor Q is observed directly. When construct- 
ing the likelihood function, they are often integrated out (Felsenstein 
1988; Griffiths and Tavare, 1994b; Kuhner et al 1995). 

The frequency trajectory of the selected mutant, TL, is a random 
process that follows the Wright-Fisher model and has a probability 
distribution, P(H\s, F), that depends on the selection intensity s and 
the nuisance parameter set T, which includes all the other parame- 
ters related to the population history. Conditional on any given fre- 
quency trajectory, the sampling probability of the data is constructed 
by summing over all possible genealogical events: f'(T>\Ti) = 

jv{v\g)F{g\H)dg. 

For computationally efficient evaluation of the sampling proba- 
bility, we propose a novel simplified multilocus model for the 
transition of different types of selected haplotypes (see the section A 
simplified multilocus model for haplotype structure). We compute the 
extent of the ancestral haplotypes in the vicinity of the selected 
mutant under the combined effects of recombination and selection. 
Together with the infinitely-many-sites model for mutations, the 
simplified multilocus haplotype model is used to approximate the 
evolutionary dynamics of the data. 

Because the spaces of both gene genealogies and allele frequency 
trajectories are too large to explore, V(g\H) and F(7i|s, T) cannot be 
expressed in closed forms. We use the importance sampling algo- 
rithms to sample genealogies and trajectories that are compatible with 
the data from the proposal distributions. Then the likelihood is esti- 
mated as the weighted average probability for the samples. The im- 
portance weights are obtained by taking the ratio of the probabilities 
of true distribution and the proposal distribution. The procedure of 
evaluating the likelihood is illustrated by the flowchart in Figure 2. 
Two main steps of the flowchart correspond to sampling Tl and g 
using the importance sampling algorithms. The details of the algo- 
rithms and calculation of the importance ratios are presented in the 
section Importance sampling and proposal distributions. 

A SIMPLIFIED MULTILOCUS MODEL FOR HAPLOTYPE 
STRUCTURE 

We model the transition of the selected haplotypes (haplotypes 
carrying the selected mutant) under the influence of evolutionary 
events including recombination and mutation. We start with a sample 
of selected haplotypes collected from the current generation. When 
looking backward in time, we can eventually trace these selected 
haplotypes to one common ancestor (the ancestral haplotype) because 
all copies of the selected allele are descended from a single mutation. 
During the selection process, recombination breaks up and mixes the 
fragments with the background haplotypes. Recombination combined 
with mutation generates the different selected haplotypes that contain 
some segments of the ancestral haplotype. The number of distinct 
selected haplotypes at different times in the history is called the 
ancestral process. This ancestral process, conditional on the frequency 



trajectory of the selected mutant, can be viewed as a structured- 
coalescent process, because the selected haplotypes evolve as a sub- 
population of the entire haplotype pool, with the size of the 
subpopulation determined by the mutant allele frequency, and the 
transitions among different haplotypes following the simplified 
multilocus model. 

To illustrate the state space of the ancestral process and the joint 
effect of selection and recombination on the transitions between the 
ancestral states, we will start with a simple case of only two loci, the 
selected locus and a partially linked SNP locus. We then extend the 
two-locus model to a simplified multi-locus haplotype model, after 
making several approximations for computational efficiency. Then in 
the section Sampling probability of a multi-locus haplotype configura- 
tion, the simplified multilocus haplotype model and the infinitely- 
many-sites model for mutation are used to derive the sampling prob- 
ability for haplotype configuration of a sample by summing over 
possible ancestral states of the genealogical history, that is, the 
^ G P(23|C/)P(C/|7Y) component of likelihood function conditional 
on a simulated allele frequency trajectory. 

A two-locus model 

The two-locus haplotype model involves only the selected locus and 
one neutral marker, the positions of which are assumed known. The 
selected locus has the mutant allele A and the other neutral allele a. 
The neutral marker locus has two alleles B and b. Let Q(f) = (q lt q 2 ; <j 3 , 
q 4 ) denote the number of haplotypes AB, Ab, aB, and ab in the sample 
at time t. Conditional on the ancestral allele frequency trajectory, {X t , 
t > 0}, the "ancestral process" Q(t), which is defined as the numbers of 
each haplotype, can be approximated by the inhomogeneous Markov 
process (Hudson and Kaplan 1988; Durrett and Schweinsberg 2004). 
The states that the process Q(t) can jump from state (q lt q 2 ; q 3 , q 4 ) to 
include {q 1 - 1, q 2 ; q 3 , q 4 ), (^ - 1, q 2 + 1; q 3 , q 4 ), {q x - 1, q 2 ; q 3 + 1, 
qi), (<?i ~ l. <£>; <?3. <?4 + l). (<?i> <?2 ~ l; <?3> qd, (<?i + 1. <?2 - 1; <j3> qd, 
(<?i> <?2 - i; <?3 + i. qd, (<2i> <?2 - i; <?3> <24 + i). (<?i. <?3 - i. qd, (qi + 
i. q-i, <j3 - i. qd, (<ji> <?2 + i; <j3 - l. qd, (<?i> qi, - U qt + l). (<?i> qi, 
qi, q4 - i). (<?i + i. qi, qi, q^ - 1). (qi, qi + i; q?,, q* - i) and {q u q 2 , 

q 3 + 1, q 4 — 1). The transition probabilities from (q lt q 2 ; q 3 , q 4 ) to the 
first four states are listed in Table 2, and the other transition proba- 
bilities can be constructed similarly. We assume an infinitely-many- 
sites model for mutations, so there are no new or recurrent mutations 
between the two alleles of either the selected locus or the neutral 
marker locus. Let N AB (f), N A y(f), N aB (t), and N a y{t) be the population 
counts of the four corresponding haplotypes AB, Ab, aB and ab at 
time t respectively. For the transition from {q Y , q 2 ; q } , q 4 ) to (qi — 1, 
q-i, qi, <?4)> no recombination has occurred and two lineages of 
haplotype AB are chosen to coalesce. The coalescence rate is 
qi{\ — r) x a u\ i for the transition from (q ly q 2 ; q 3 , q 4 ) to (q t — 1, 

q 2 + 1; (j3, q 4 ), one of the N A y(t) + N a f,(t) lineages, which carry 
the b allele, must be chosen to recombine, and the rate is 
qir — %. — Si And the other transition rates can be obtained 
by similar rationale. Note that the selected allele first entered into the 
population at time T, so the states of the embedded Markov chain 
should satisfy Q(T) e {(1, 0; q 3 , q 4 ), (0, 1; q 3 , q 4 )} and Q(f) = (0, 0; q 3 , 
q 4 ), t > T. 

A simplified multilocus model for haplotype structure 

The model for two-locus haplotypes can be naturally extended to 
multilocus haplotypes. However, the extent of haplotype structure that 
is used to infer selection intensity and allele age usually spans a large 
region which covers several hundred kilobases, or even more than 
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For a given parameter setting, {*. I'}, 
simulate an allele frequency 
trajectory H with the backward 
Wright-Fisher model. 
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Estimate the probability for the ml li 
simulated allele frequency trajectory 
V. { '" ) under forward and backward 
Wright-Fisher process ( 
Pf-CH 1 " 0 ) and f>i 1 ( H u " ) )), and then 
estimate the importance ratio 
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Given a simulated trajectory W m \ a 
Markov process £/"' 1 starting from 
configuration in the current generation 
((F. n)u) is constructed and 
genealogical events are simulated 
backward until the Markov chain 
reaches the absorbing state ( ( Tu. ei ) T ) 
at time r. 



V 



Estimate the importance ratio as the 
product of sequential importance 
weights for L steps of the Markov 
process before its absorption: 
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Repeat K times 



Figure 2 Flowchart of impor- 
tance sampling procedures of 
the method. 




Estimate the likelihood function: 
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a mega-base. In such a large region, there may be hundreds of 
polymorphic sites. When the number of SNP loci increases, the number 
of possible states of the ancestral process increases exponentially, and 
the transition matrix becomes so large that numerical evaluation 
becomes impossible. It is thus necessary to develop a parsimonious 
model for multilocus haplotypes that is both computationally fast 
and statistically efficient. 

The novel multilocus model we present here exploits the extents of 
the "ancestral haplotypes" retained during the selection process. We 
use the term "ancestral haplotype" to refer to the alleles at each SNP 
position on the ancestral chromosome, and the term "background 
haplotypes" to refer to the other haplotypes. In this model, we con- 
sider the interplay between selection and recombination acting upon 
the ancestral haplotype. As the selected allele increases its frequency, 
recombination breaks up the ancestral haplotypes and mixes them 
with the background haplotypes, resulting in the sample we observe 
at present. We make several simplifications or assumptions to expe- 
dite the computation in the sections to follow. 



J 

The ancestral state of each position along the haplotypes is 
assumed known: For each position of a chromosome, it is assumed to 
be known whether the allele at that position is descended from an 
ancestral haplotype or one of the background haplotypes. In 
reality, the ancestral haplotype information cannot be observed 
directly from the data. The ancestral states and the break points of 
the ancestral haplotypes have to be inferred for each chromosome 
from patterns of SNP variation by other means (for example, the 
hidden Markov model for detecting recent positive selection, see 
Chen (2007)). 

The haplotype structure of background haplotypes is ignored: It 

is reasonable to believe that the primary information for the 
inference of allele age and selection intensity comes from the 
extent of the ancestral haplotypes retained during selection. For 
example, states of the jump process for a two-locus haplotype 
model are reduced to Q(r) = (q u q 2 ), and the absorbing states are 
now (1,0) and (0,1). 



1432 I H. Chen and M. Slatkin 



"~5tG3'Genes I Genomes | Genetics 




12 3 4 



Figure 3 A realization of the genealogies for a sample of four 
haplotypes (lineages 1-4) to illustrate possible events in the geneal- 
ogies. Black denotes the ancestral haplotype region (see the main text 
for the definition of "ancestral haplotypes"), and white denotes back- 
ground haplotypes. A star denotes a neutral mutant arising on the an- 
cestral haplotype. The present time, t= 0, is on the bottom. When going 
back in time, the events are coalescent (lineages 2 and 3 coalesce to the 
ancestral lineage 5), recombination (lineage 6 — > lineage 4), coalescent 
(lineages 5 and 6 coalesce to the ancestral lineage 7), recombination 
(lineage 8 — > lineage 7), mutation (on lineage 9), coalescent (lineages 8 
and 9 coalesce to lineage 10) in sequence. 

The population frequencies of ancestral haplotypes at time t are 
approximated using the expectations: To obtain the transition rates 
in Table 2 for a two-locus model, the population counts of haplotypes 
AB, Ab, aB, and ab at time t, N AB (t), N Ab (t), N^t), and N ah (t), are 
needed. For a multilocus haplotype model, these values correspond to 
the population frequencies of haplotypes at time t. In studies of fine- 
scale disease mapping, these allele frequencies were assumed to be 
constant over time and identical to the observed frequencies in the 
current population (Rannala and Reeve 2001). When there is selec- 
tion, the allele frequencies of nonrecombined haplotypes change over 
time, and their expectations have to be derived using a deterministic 
model. These expectations will be used as an estimate of the true 
haplotype frequencies at any time t. The details of the equations for 
the haplotype frequencies at time t can be seen in Appendix A. 

Multiple "migrations" of ancestral haplotype fragments between 
selected and neutral haplotypes are ignored: In the ancestral process 
of a two-locus haplotype model, there is a probability that a lineage of the 
neutral marker experiences two recombination events during the sweep 
process. In other words, the ancestral haplotype crosses over twice with 
a background haplotype at the marker position and the segment of the 



ancestral haplotype "migrates" to and then back from the group of 
neutral haplotypes. The probability for such events are small during 
a selective sweep, and thus are ignored (on order O (j^jjij > where 
a = 2N s, see Etheridge et al. 2006). 

Because of the assumptions (1) — (4), the state space of the ances- 
tral process can be reduced by considering the unique pattern of 
ancestral haplotype lengths in combination with the occurrence of 
mutation since selection began. Because we have assumed that the 
ancestral states of SNPs along the chromosomes are known, for every 
haplotype, we can determine the break points of the ancestral haplo- 
type caused by the recombination events nearest to the mutant, in 
addition to the locations where mutations occurred within each an- 
cestral haplotype region. With this information known, the ancestral 
haplotype on each side of the mutant can be coded as follows: for 
every selected haplotype, we record the SNPs to the left and to the 
right sides that delimit the ancestral haplotype; if there are mutations 
within the ancestral haplotype regions, the positions of the mutants 
are also recorded and listed as "mutation coordinates" behind the two 
"recombination coordinates." In Table 3, we give an example of 10 
selected haplotypes consisting of 25 SNPs, among which the ancestral 
haplotype regions are highlighted. The selected mutant is located at 
position 18 (shown in boldface type). The left end of the ancestral 
haplotype for the first haplotype is 7 to the left of the mutant, and the 
right end is 6 to the right, such that the first haplotype is recorded as 
(7, 6). For haplotype 3, the full code is (12, 7, 21) with a mutation 
occurring in position 21. By this coding rule, the configuration of the 
sample listed in Column 3 of Table 3 is summarized in Column 3. 

For a recoded haplotype type h = (R lt R 2 , Mi, . . ., M^), the first two 
entries, corresponding to the left and right break points of the ances- 
tral haplotypes, are the "recombination coordinates" and the other 
entries are the "mutation coordinates." In this model, the transition 
among different haplotypes is caused by recombination and mutation. 
For the coded haplotypes consisting of only recombination coordi- 
nates, the transition among different haplotype types can occur only 
through recombination. If there are m L loci to the left of the mutant 
and m R loci to the right of the mutant, the total number of possible 
allele types is (m L + 1) x (m R + 1). The number of possible states is 
greatly reduced compared to a direct extension of the two-locus 
model, whose state space grows exponentially with number of SNPs. 

In Table 4, we present a partial list of transition rates caused by 
recombination for a 4-locus haplotype model. Assume that the hap- 
lotype has 4 SNP loci, with the alleles on the ancestral haplotype being 
A, B, C, and D. A is the selected mutant, and the order of the four loci 
along the chromosome is the same as their alphabetical order. We use 
the notation [ABCD] to denote the intact segment of ancestral hap- 
lotype. And similarly [AB — d] indicates that the first two loci have 
the inherited ancestral haplotype of A and B, the allele of the third 
locus is arbitrary, and the fourth locus is a background haplotype. 
Examples in Table 4 show some of the one-step transition rates start- 
ing from state [ABcd]. For example, for haplotype [ABcd] to jump 
to [ABCD], one of the [ABafl haplotypes should be chosen and re- 
combination has to occur between allele B and c, in such a way 
that the chosen haplotype crosses over with haplotypes [ABCD], 
[AbCD] and [a — CD]. The one-step transition probability is: 

r BC {P[ABCD](t) X t + P[AbCD](t) -X t +P [a (t) ■ (l - X t )}, with X t 

being the population allele frequency of allele A at time t, and P[.](f) 
being the frequency of the haplotype in square parenthesis among 
either the selected haplotype or the background haplotype group, 
depending on the allele type carried by the particular haplotype at 
the selected mutant locus. Because of assumption (4) we made pre- 
viously (also made by Durrett and Schweinsberg (2004)), the second 
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Table 1 Definitions of notations used in this article 



Notation Meaning 



^sample 


Tota niimhpr of han ntvnpQ in thp wmn p 

1 ULul MUI 1 IUCI Ul lldkJI^ylykJCO III LI 1 C JO 1 1 1 LJ 1 


n 


Number of selected haplotypes 


m 


Number of SNPs of a sample 


Wi and wig 


Number of SNPs on the left and right sides of the mutant 


Dm = 0 or 1 


The /th SNP of the /th haplotype 


N t 


Population size at time t 


T 


Allele age, or the time when the mutant arose in the population 


s 


The selection coefficient 




Recombination fraction of the haplotype 


LL 


Mutation rate of the haplotype 


6 = 4N{x 


The scaled mutation rate of the haplotype 


p = 4Nr 


Thp era po rppomhination ratp of thp nan nt\/np 

IIItT OLulCU ld~wlllkJIIIC1Llwll 1 uLC Ul LI 1 G 1 luUIULypL 


ft 


The proportion of ancestral haplotype region as a fraction of the length of 




the /th haplotype 


it, V 1 1 1 1 — \ i • • • i ' 1 1 'Ul 


The allele frequency trajectory 


/( 


The number of the selected allele in the whole population at time t 


X t = //(2NJ 


The frequency of the selected allele at time t 


h, = (R,, R 2 , Ml 


A recoded haplotype which includes two recombination 




Coordinates and k mutation coordinates 


T = {h, 


The d different haplotypes of a sample 


n = {m, . . ., 


The number of haplotypes for each haplotype group in T 


(T, n), 


The sample configuration at time t 


q((T, n) t ) 


Sampling probability of the sample configuration (T, n) at time t 


ej = (0, 0,. . .,1,. . .,0) 


The /th unit vector 


y(v n) = ( n Va„X„T 1 + n0/2 + no/2 


The total rate for events at time v 




The ratio of population size at tto that at the present 


Of 1^ 


Hpnntpe a Qnift onprator ann .*>r)i Hpnntpe Hp ptinn thp firet mutation 

<Jf UCI lULCo O J 1 1 1 1 L UUCI u LUI ^ CI 1 1 KJ ' ' ' K LCj UCICLM IU LI 1 C 1 1 1 3 L 1 1 IU Lu LIUI 1 




coordinate of the /cth haplotype 


Chi 


C denotes a coordinate change operator, and Ch k denotes changing one 




of the two recombination coordinates of the !<th haplotype and elimi- 




nating all mutation coordinates outside the ancestral regions delimited 




by the new recombination coordinates 




The deleting operator that deletes haplotype from T 


£(s) 


Likelihood function of the data 




The mth genealogical history, which consists of multiple steps of events, 




including recombination, mutation and coalescences 


Q(W) 


Proposal distribution for H in the importance sampling algorithm 


Q(0IW) 


Proposal distribution for Q conditional on a H in the importance sampling 




algorithm 



and the third terms are small and can be ignored. With this simpli- 
fication, the transition rate in Table 4 becomes r BC ■ f [abcd](0 • X* 
Similarly, we can simplify the other transition rates shown in Table 4. 

A mutation coordinate records the SNP position at which the 
haplotype has an allele mutated from the ancestral haplotype. We 
assume an infinitely-many-sites model for mutations on the haplo- 
types. According to Griffiths and Tavare (1995), the set of nonrecom- 
bining haplotypes carrying these mutations is identical to a rooted 



gene tree, and the sequence of mutations corresponds to the path from 
the haplotype to the common ancestor [the root of the gene tree, see 
Griffiths and Tavare (1995) for a detailed discussion]. We use the 
same notation scheme for mutations as Griffiths and Tavare (1995). 
Because the haplotypes we investigate are from recombining regions, 
an additional constraint is added to reflect the effect of recombi- 
nations: the sequence of mutations we recorded as mutation coordi- 
nates includes only those located between the two recombination 



Table 2 Possible transitions from (q-i, q 2 ; 93. 94) and the rates for the two-locus haplotype model 



Transition 



Rate 



(qi, q2, 93. qJ-Kqi - 1. qi, q3, q 4 ) 
(qi, q-z, q3, q4)-*(qi - 1, qi + 1; q3, qd 
(qi, q2, q3, q4)-Kqi -1, qz) q3 + 1, qJ 
(qi, qz, q3, q-J-Kqi -1. q* q3, q4 + 1) 



qi(i-r) 



(qi -1) 



qi 



N AB (t) 

N Ab (t) + N afa (t) - q 2 - q 4 



2N t 

N A B(t) + N aB (t) - q 3 



2N t 

N A b(t) + N ab (t) - q 2 - q 4 
2N t 
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Table 3 An example of haplotype configuration to demonstrate 
the coding rules used to denote the haplotype structure 



Haplotype Type 



Count Number 



Code 



1 1 1 1 1 1 001 1 000000000000001 
1111111111111000000000000 
11111 0000000000000001 0000 
000000000000001 0000000000 



(7,6) 
(4,7) 

(12,7,21) 
(17,7,15) 



There are 10 haplotypes with 25 single-nucleotide polymorphisms in four 
distinct groups in the sample. The mutant is located in position 1 8 and shown in 
boldface type. The ancestral region for each haplotype is highlighted. The 
codes for the four haplotype groups are listed in the third column 



coordinates, that is, the subset of mutations on the retained ancestral 
haplotype region. For haplotype data from nonrecombining regions, 
the recombination coordinates are identical for all haplotypes (the left 
and right ends of the whole haplotype), and the mutation coordinates 
define a gene tree with the rooted genealogy, meaning that the state of 
the common ancestor of the sample is known (Griffiths and Tavare, 
1994b). This is the type of data analyzed by the approach of Coop and 
Griffiths (2004). Thus their method can be viewed as a special case of 
our method with no recombination. 

Sampling probability of a multilocus 
haplotype configuration 

In the section A simplified multilocus model for haplotype structure, we 
described a novel simplified multilocus model that can dramatically 
reduce the state space of the haplotype ancestral process, and illus- 
trated how to obtain the transition probabilities between different 
states. We now consider the computation of the probability of a sample 
of multilocus haplotypes. 

A sample of selected haplotypes can be coded and summarized by 
the rules introduced in the section A simplified multi-locus model for 
haplotype structure and grouped into d distinct groups T = {h u . . ., 
hj] with the corresponding multiplicities n = {«!, . . ., n^}. We define 
the sampling probability, q((T, n) f ), to be the probability of observing 
the sample configuration (T, n) at t generations before the current 
generation. The entire history of the sample configuration {(T, n)„ t > 
0} can be described by a Markov process that starts at time t = 0 and 
continues until reaching the absorbing state (T, at a random 
time t, where denotes the unit vector = (0,0, . . ., 1, . . ., 0) with 
only the jth entry being 1. When t = 0, q((T, n) 0 ) is the partial 
likelihood of the data which is sufficient for the inference of se- 
lection intensity and allele age. The sampling probability at time t 
can be obtained by recursively summing over all possible state 
paths in the backward Markov process. The recursive formula 
can be written as 

f- CO 

q((T,n) t ) = / £ p((T,n) t \(T',n') v )q((T',n%)g(v\n,t)dv, 

Jt IT' n '\ 



(2) 



where p({T, n) t \(T' , n') v ) is the transition probability of jumping 
from state (T', n') at time v to state (T, n) at time t, and g(v\n, t) is 
the density function of the inter-arrival time to the next event given 
an event at time t. 

As the Markovian ancestral process is restricted to the selected 
haplotypes, the process behaves as if in a population with temporally 
varying size {/<, t = 0.1, . . ., Tj. If time is measured in a scale of 2N 0 
generations, the coalescent rate is (")(A t X f ) , with X, = I t /2N t and 
\ t = N,/N 0 being the population size ratio. With the same scaling, 



the mutation rate for the ith haplotype is = 4N 0 /li/3, = d /8„ and 
the recombination rate is p, = 4N 0 cf3j = p/3,. 6 = 4N 0 /u. and p = 4N 0 c 
are the scaled mutation rate and recombination rate for the whole 
haplotype, and /3, denotes the proportion of the retained ancestral 
haplotype region, or the inter-region between two recombination 
coordinates, out of the entire length of the ith haplotype. Note 
that pi changes over time with the change of recombination 
coordinates of the ith haplotype. The inter-arrival time to the 
next event, v, given that the last event happened at time t has 
a non-homogeneous exponential distribution, with the density 
function in the form of 



g(v\n,t) = y(v, n)exp 



y(u, n)du 



(3) 



t < v < oo, where at time v, y(v, n) = ( " ) (\ t X v )~ l + 
YTi=i®i/2 + YTi=iPil^ i s the rate for the any events. As the allele 
frequency trajectory {X„ 0 < t < T} is a discrete-time random 
process following the Wright-Fisher model, we adopt the geometric 
distribution for discrete time instead of using the continuous ap- 
proximation in Equation 3: 



g(v\n,t) = y(v,n)x (l - y(u, nj). 

u=t+l 



(4) 



Conditional on an event happening at time v, the probabilities 
for the event being a mutation, recombination or coalescent are 
respectively 



y(v,n) 



Elm/* 

y(v, n) 



and 



1 

A V X V 
y(v,n) 



(5) 



If a mutation occurs, one of the lineages in the sample is chosen to 
mutate into other types according to the mutation model, and the 
mutation coordinate of that haplotype is modified correspond- 
ingly; if a coalescence event occurs within the jth haplotype group, 
two of the existing lineages with haplotype hj are chosen at ran- 
dom to coalesce, and the number of lineages in the jth haplotype 
group, rij, is decreased by 1; otherwise, a position along the hap- 
lotype is chosen for the recombination event to occur with the 
consequence that one of the recombination coordinates is changed 
to record the recombination at that position (see Figure 3 for 
a realization of the genealogical history for a sample of four 
haplotypes). 

We now present the detailed recursion equation (Equation 6), 
expressed as a sum over the above three types of events in a way 
corresponding to Equation 2 for our model. Under the infinitely- 
many-sites mutation model (Watterson 1975) and the proposed mul- 
tilocus haplotype model for the extent of ancestral haplotypes, sum- 
ming over possible one-step configuration changes at time v leads to 
the following equation: 



Table 4 The transition probabilities for some states of the 
multilocus haplotype model 



Transition 




Rate 




[ABcd]^[ABCD] 
[ABcd] — > [ABCd] 
[ABccH^lAbcd] 


r B c{P[ABCD](t) ■ X, + 
rec{P[ABCd](t) ■ X, + 

Fab{P [Abed] 


P[AbCDj(t)-X t + P la -CD](t)-(1 
P[AfaCd](t)-X,+ P Ia -cdl(t)-(1 - 

(t)'X t + P [aBcd| (t).(1-X t )} 


-x,)} 

-x,m 
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n(n k - 1) 



k:n k >2 MY," 



+ E 

fc:Mfc=l,hft distinct. 
.'/h^hj for all 7 



X-yXy 



<j((T,n-e k ) v ) 



2y(v, n 



6{n, + 1)13, 

+ E E \ J ( i 1 

k:n t =l,h distinct j:. c /'h k =hj z 7\ v > n ) 

Xq((^T,^ jfc (n + e j )) v ) 



E E 

i j: Whj=hj,nj = 



2y(v,n) 



l )ph),h 



Xq((fh i T,ti-e i 



The notation in Equation 6 has the following meaning: e k is the Mi 
unit vector, representing a multiplicity of the fcth distinct haplo- 
type; pi is the length of ancestral haplotype region divided by 
length of the Ith haplotype. We follow the notation of Griffiths 
and Tavare (1995) and use several operators to denote the changes 
of sample configuration: 5 is the shift operator that can be oper- 
ated on a specific haplotype h k or the entire haplotype set T. 
Specifically, Sh k represents the haplotype obtained by deleting 
the first mutation coordinate of haplotype h k . Similarly, S^T rep- 
resents the new set of distinct haplotypes obtained by deleting the 
first mutation coordinate of the kth distinct haplotype, h k , in T, 
Another operator that operates on the entire haplotype is IZu which 
removes the fcth distinct haplotype h k from the set T. The third 
coordinate change operator C is defined in this manuscript to 
denote the coordinate changes caused by recombinations: Ch t 
changes one of the two recombination coordinates of haplotype 
i, Rh : = {Rhj,i,Rh,,2}-> an d eliminates all mutation coordinates out- 
side the regions delimited by Rf, i and Rh h 2- 

Next we explain how we derive the recursive formula in Equation 
6. Starting from time v back from the present, there are four possible 
paths to arrive at the sample configuration (T, n) at time v: (1) a co- 
alescent event occurred at time v, and a possible sample configuration 
prior to time v was (T, n — e^); (2) a mutation occurred on a haplo- 
type hi that had only single multiplicity, or n k = 1, at time v, and a new 
mutation coordinate was added to h k \ (3) a mutation occurred on 
a haplotype hj that had multiplicity greater than 1, or Hi > 1, at time 
v, and a new haplotype h k with n k = 1 was generated by adding the 
new mutation coordinate to hp (4) a recombination event occurred, 
and altered one of the recombination coordinates of hi by that of hi, 
Note that a recombination event not only changes the recombination 
coordinate, it also changes the mutation coordinates: after the recom- 
bination coordinates are changed by a recombination event, all the 
mutation coordinates of that haplotype are checked, and only those 
located within the interregion between the two new recombination 
coordinates are kept. The four terms on the RHS of Equation 6 
correspond to the above four paths respectively, and the derivation 
of the first three terms follows Griffiths and Tavare (1994a, 1995). 
In the first path, the sample configuration at time v compatible 
with the occurrence of coalescence is (T, n — n k ), the probability 



And whenstarting from the configuration (T, n — n k ) and going 
forward in time, the probability that one of the n k — 1 haplotype 
is chosen to duplicate is "* J 1 (Griffiths and Tavare, 1994a). The 
one-step transition probability of p((T, n) v |(T, n — e k ) v ) is then 

, (") 

1A x y ^ v n y Note that a restriction for haplotype group h k is that 
there must be more than one lineage in group h k at time v. Sum- 
ming over all possible haplotype groups that have multiplicity n k > 
(6) 2 at time v, and are compatible for coalescent events to occur, we 
obtain the first term of Equation 6. The second and the third path 
correspond to the cases when the event occurring at time v is 
a mutation. Under the assumption of the infinitely many-sites 
model, if a mutation event occurs, it can result only in one of 
the single-multiplicity haplotype groups at time v (n k = 1) and 
the mutation coordinate must be a singleton in the sample config- 
uration at time v. In both the second and the third paths, the 

chance that a mutation occurred at time v is Sj^r - The configu- 
ration at time v compatible with the occurrence of second path is 
(S^T, n), and the probability for the mutation to happen to hap- 
lotype h k is ^=^* — . Summing over all haplotypes satisfying n k .= I 

and Sh ^ hj for all j yields the second term in Equation 6. In the 
third path, the probability for the mutation to happen to haplotype 

hk is Ej\?7i t =ft. \i 1 with the sample configuration prior to the 

event being (J% k T, £%k(n + Cj)) for all k with n k = 1 at time v. In 
the fourth path, recombination occurs with a probability of 

^y(vn) ■ ^ recombination causes a haplotype hj to become h h 
the haplotype configuration prior to the event is (C?!,-T, (n — 
e, + e0) and the probability for a haplotype hi changing into h t is 

&=S-^- -p hjjv w ith n j — 0- The transition probability p^. ^ between 

different haplotypes follows the multilocus haplotype model pre- 
sented in the section A simplified multi-locus model for haplotype 
structure, where h t and hj correspond to one of the distinct hap- 
lotypes defined by the "recombination coordinates". Combining 
these possibilities and averaging over the time to the first event 
more ancient than t, the sampling distribution of haplotype con- 
figuration, (T, n), is analogous to Equation 2: 

q((T,n) t ) = J" <,((T,n) v )y((T,n) v )exp(- J\((T,n) u )du\dv 

(7) 



that the event occurred at time v is a coalescent event is 



Although we have reduced the state numbers defined by 
recombinations from 2( mi+mB+1 ' to (m L + 1) x (m R + 1) for the 
simplified multilocus haplotype model, it is still difficult to numer- 
ically solve the distribution function induced by the Markov process. 
Therefore ,we still need to use the importance sampling algorithms 
to estimate the sampling probability, as will be shown in following 
sections. 

IMPORTANCE SAMPLING AND PROPOSAL 
DISTRIBUTIONS 

Likelihood and importance sampling 

Importance sampling algorithms are used to efficiently sample from 
^ 2 ) the probability spaces of frequency trajectories and intra-allelic 

A v x v y(v,H)' genealogies in order to approximate the integral in the likelihood 
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function (see Equation 1). The likelihood function in Equation 1 can 
be expressed as 



= / 1 mo 



p(g\h) 
Q(G\H) 



Q(G\n) 



¥(H\s, r) 
Q{H\s,T) 



Q{n\s,r)dgdn, 



(8) 



where Q(Q\H) and Q0W|s, T) are the proposal distributions that 
have non-zero weight only on genealogies and trajectories compat- 
ible with the data V (that is, ¥(V\ Q) = 1). Suppose that M random 
frequency trajectories and L genealogical histories for each of the 
trajectories are sampled, then the approximation to Equation 8 
becomes 



C(s) 



1 1 

MK 



EE 





sj) 













(9) 



where TiS m) and C/ ((c) are the with and fcth independent samples 
from the proposal distributions. The ratio is called the importance 
weight. The importance sampling algorithm for the genealogies will be 
presented in the section A proposal distribution for sampling genea- 
logical histories conditional on a trajectory and the algorithm for the 
allele frequency trajectories in the section The proposal distribution for 
sampling allele frequency trajectories of the selected allele in a popula- 
tion of time-varying size. We illustrate the proposal distributions and 
the calculation of importance weights in those two sections. 

Allele age, T, is not explicitly expressed as a variable in the likeli- 
hood function. It is the end point of the frequency trajectory, and thus 
depends on s through P(7^|s). Once the maximum likelihood estimate 
s is found, the posterior distribution of T can be obtained from the 
repeated samples of Ti. given s = s. This method for estimating allele 
age has been used by Coop and Griffiths (2004), Saunders et al. (2005) 
and Wood et al. (2005), while it is different from the Bayesian ap- 
proach of Slatkin (2008), who assumed a prior for allele age and 
jointly inferred both selection intensity and allele age. 

A proposal distribution for sampling genealogical 
histories conditional on a trajectory 

The recursion of the genealogical histories given in Equation 6 for the 
likelihood of the data cannot be computed exactly for large data sets 
since there are too many compatible sets of ancestral states. We adopt 
an importance sampling algorithm to approximate the likelihood by 
Monte Carlo methods. There are many ways of constructing the 
proposal distributions for the importance sampling algorithm (Grif- 
fiths and Tavare, 1994b; Stephens et al. 2001; Paul et al. 2011). Here 
we follow the scheme developed by Griffiths and Tavare (1994b). As 
described in previous sections, the infinitely-many-sites model for 
mutations in conjunction with the simplified multi-locus haplotype 
model is assumed. 

In the algorithm, a Markov process starting from the configuration 
in the current generation (T, n) 0 , conditional on a randomly sampled 
historical frequency trajectory {X„ t = 0, . . ., T}, is constructed and 
simulated backward in time until reaching the absorbing state (T, 
at time t. The algorithm is summarized as follows: 

1. Generate time to the next event, v, by the density function given 
in Equation 4; 

2. Choose one of the three possible events (recombination, mutation 
or coalescence) from the proposal distribution. We first define the 
total rate that any event occurs at time v as 



' A„X„ 



- Y. E PA n i + O/v* + en 

• j: Wht=h) 



(10) 



where 

E ft+ E E in+ l )-Pj- 

k,tlk=l,hk distinctj.V'TifcT^fy for all j k:n k = i,hk distinct j:^hk=hj 

(ii) 



The proposal distribution is designed in such a way that a possible 
event at time v is chosen with probability in proportion to the size of 
each term in h({T, n) v ): 



n(n k - 1) 



k v X v 



p((T>,n%\(T,n) v ) = \ 

*((?»„) ' 
*((?»„) ' 



(T',n') = (T, n — et) and > 2, 
(T',n') = (.^T,n), 

(T',n') = (a k T,& k n + e)), 

(T',n') = (rfcfT, a -e» + «j). 

(12) 



3. Update the configuration to reflect the chosen event. Let Qj, j & 
0 denote the jth event during the genealogical history, and Q 0 = 
(T, n) 0 . Q(Gj \Gj-i) is the transition probability of the backward 
Markov process determined by the proposal distribution (Equa- 
tion 12). Similarly, P(5 ; _i \Qj) is the transition probability of the 
forward Markov process. The sequential importance weight for 

, , , HQi~i\Gi) , 

the jth step change — ^— is estimated. Here we illustrate the 

Q.(Sj\Qj-i) 

calculation of the importance weight for the case in which the 
chosen event is = (T, n — e^). As shown in the section 
Sampling probability of a multilocus haplotype configuration, 

(")(»*-!) 

y(v,n)(»- 1)' aru ^ fr° m Equation 12 we have 



'j\Gj-\ 



H(7,n) r 



Taking the ratio of the two terms, we 



obtain the importance weight for the jth step: ^^vn) Table 5 
provides more details of importance weights for other events. 

4. Repeat steps 1—3 to continue generating the historical events in 
the genealogy backward in time. 

5. Stop when the absorbing state is reached, that is, a single lineage 
remains in the sample configuration, (T, ei), or if the proposed 
time for next event is beyond the end of the frequency 
trajectories; 

6. Assume that there are J steps until the Markov chain reaches the 
absorbing states, the ratio of the forward/backward paths is the 
product of sequential importance weights: 



H 



n 



n 



HGj-i\Gj) 



}{Gj\Gj- 



(13) 
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Table 5 The proposal distribution and importance weights for the importance sampling algorithm presented in the section A proposal 
distribution for sampling genealogical histories conditional on a trajectory 









Importance Weight 


(T, n - e k ) 


n(n k - 1) 




*((?>)„) 


A v X v fe((T,n) v ) 


2y(v, n) 



[S k T, n) 

(n k r, n k (n + ej)) 

(ChfT, n - ei + ej) 



0(nj + 1)/3j 

Bjpjrij + 1)p h , A 
»((T",n) y ) 



A„X v y(v, n)(n- 1) 

T(v,n) Ewft 

EM/3/0/2("j + 1 )Pj 
T(v,n) Etift 

Pjp(n + i)phj,h, 

y(v,n) 



2y(v, n) 
2y(v, n) 
2y(v, n) 



denotes four possible events of the genealogical history. P^j-il^) and Q[&j\Gj-i) are the one-step transition probability of the forward and backward Markov 
process constructed for simulating the genealogical history. The importance weight is estimated by Q(g.jg!^) 



and is used in the likelihood function. For those paths with time 
beyond the end of the given frequency trajectories, {X,}, the ratio is set 
to zero, which means the sample is rejected. 



The proposal distribution for sampling allele frequency 
trajectories of the selected allele in a population of 
time-varying size 

We use the backward Wright-Fisher model under selection to sample the 
allele frequency trajectories. The importance sampling algorithm for 
sampling frequency trajectories of the selected allele is described as follows. 
A detailed explanation can be found in the original paper (Slatkin 2001). 

1. Given a selection intensity s and parameter set T, a sample path is 
simulated from t = 0 (current generation) with I 0 copies of A, and 
then proceeds backward from generation to generation assuming 
the following binomial distribution: 



2N t —I t 



(14) 



where Y{- 1 satisfies: Y{- 



+ft(l ~ 17- 1) 



.-1, and 

Y t - 1 = I t - 1 /(2N t - 1 ). The backward process is stopped 1 at time T when 
the allele is lost. The probability of the backward process is calculated 

as: P 5 (W (m) ) = rL=iP(-r t |'t-l)- 

2. For a frequency trajectory 7i (m > simulated in Step 1, the proba- 
bility it is generated by the forward process is computed. In the 
Wright-Fisher model with selection, the number of allele A from 
generation t to generation t — 1 follows a binomial distribution: 



P(If-l|/r) 



2N t -i 
h-i 



with 



x t =x t 



1 +siX t + s 2 (l -X t ) 
1 + Sl Xf + 2s 2 X t {l - X t ) 



(15) 



(16) 



which is the allele frequency of A after selection in generation t. In 
Equation 16, X, = I t l(2N t ) is the frequency of allele A before selection 
in generation t. The probability of the sample path 7i (m) is 



7 F (n im) ) =f[p(i t - 1 \i t ) 



(17) 



where P(7 r _ 1 |7 r ) = 1 if I T -! = 1 and 0 otherwise. And the subscript _F 
indicates that the process is "forward" in time. 

3. The importance weight is calculated as (Slatkin 2001): 



■N. 



(18) 



where N is the population size at the first generation after the allele is 
lost in the backward process. The multiplication of N is needed in 
Equation 18, since the rate of influx of new mutations is proportional 
to the population size of that generation. 

APPLICATIONS 
Simulation 

Using the coalescent simulator SelSim (Spencer and Coop 2004), data 
are generated for two sets of parameters corresponding to medium 
and strong selection respectively: 6 = 4N/x = 500, p = 4Nr = 500, Ns = 
50, and 6 = 4JV/x = 500, p = 4Nr = 500, Ns = 500, where 6, p, and Ns 
represent the mutation rate, recombination rate and selection coeffi- 
cient scaled by the effective population size. The frequencies of the 
selected alleles at the present are chosen to be 0.60. Since the Moran 
model is used in SelSim, whereas the Wright-Fisher model is used in 
our method, the effective population size in the simulations is scaled 
to match that of a Wright-Fisher model by multiplying by a factor of 2 
(Watterson 1975). We estimate the log-likelihood of s for a range of 
selection coefficients with the other parameters in T known, assuming 
that the population has a constant size of 10,000. The curves of the 
log-likelihood over the grid of s values are smoothed by a local poly- 
nomial smoother. This smoother fits a linear function to a subset of 
data points within a local window of the target point where the log- 
likelihood is to be estimated. The fitting is carried out by the weighted 
least square regression, which gives more weight to points close to the 
target point and less weight to distant points. The log-likelihood is 
thus estimated as the fitted value at the target point. The size of the 
local window or the bandwidth is chosen by eye for each curve. The 
log-likelihood curves are plotted in Figure 4 and Figure 5. 

To evaluate the performance of the importance sampling approx- 
imation, we perform eight independent simulations for every 
parameter combination. One million iterations in the importance 
sampling algorithm are required to ensure good estimates, and the 
likelihood curves are presented together in Figure 4 and Figure 5. For 
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Selection coefficient 

Figure 4 The relative likelihood curve for the simulated data with the 
selection coefficient s = 0.05 and a constant population size N = 
10,000. The comparison of eight estimates of the likelihood curves is 
presented. Each estimate is an independent run of our method on 
different simulated data. The results are from 1 million iterations. 



the data set simulated with s = 0.005, the MLE ranges from 0.0041 to 
0.0073. For the data set with s = 0.05, the MLE ranges from 0.032 to 
0.0543. 

Glucose-6-phosphate dehydrogenase (G6PD) 

The G6PD gene is located on the X-chromosome. Some alleles are 
known to confer the resistance to malaria (Ruwende et al. 1995). Case- 
control studies have demonstrated that a common variant, G6PD- 
202A, reduces the risk of malaria by approximately 50% (Ruwende 
et al. 1995). This allele is at low frequency in most populations but has 
an intermediate frequency in sub-Saharan Africa. Several population 
genetic studies have investigated the effect of a recent selective sweep 
in this region (Tishkoff et al. 2001; Sabeti et al. 2002). Here we use the 
data in Sabeti et al. (2002), which consists of 252 males from there 
African populations in a 440-Kb region covering the G6PD gene. We 
analyze only the 60 haplotypes from the Beni population. There are 10 
haplotypes containing the 202 -A allele in the sample. We assume that 
the frequency of the selected allele in the Beni population is the same 
as that estimated from the sample which is 0.1667. The recombination 
fractions among SNPs are obtained by interpolation with the Oxford 
fine-scale recombination map (Myers et al. 2005). The recombination 
rate in the G6PD gene region is heterogeneous with two recombina- 
tion hot-spots, and the overall averaged recombination rate for the 
region is 1.4410 cM/Mb. We determine the end points of ancestral 
haplotypes and mutations by running the hidden Markov model 
(Chen 2007). The data configuration is coded by the rules presented 
in the section A simplified multilocus model for haplotype structure as 
shown in Table 6. 

Because the hidden Markov model analysis indicates there are no 
mutations in the ancestral haplotype regions, we set 6 to 0.0. We 
assume an effective population size of N = 10,000, which is constant 



Selection coefficient 

Figure 5 The relative likelihood curve for the simulated data with the 
selection coefficient s = 0.005 and a constant population size N = 
10,000. The comparison of eight estimates of the likelihood curves is 
presented. Each estimate is an independent run of our method on 
different simulated data. The results are from 1 million iterations. 

over time. Because G6PD is X-linked, N e is 3/4 of the autosomal size. 
We assume an additive model for selection, which means the fitness's 
of the three genotypes aa, Aa, and A A are 1, 1 + l/2s, and 1 + s, 
respectively. The likelihood of the selection coefficient is estimated by 
our method from 1 million iterations of the importance sampling 
algorithm. The log-likelihood curve is plotted in Figure 6. The selec- 
tion coefficient is estimated to be 0.0456 (95% confidence interval of 
0.0144—0.0769). From the estimated selection coefficient, the age of 
the 202 -A allele can be estimated. As shown in Figure 7, given the 
selection coefficients estimated, the corresponding posterior distribu- 
tion of allele age is plotted. 

DISCUSSION 

We have developed a likelihood method for estimating selection 
intensity and allele age from haplotype structure of multilocus SNPs 
closely linked to a selected mutant. The likelihood is based on the 
proposed simplified multilocus haplotype model, which describes 
the ancestral process of haplotype extent under the joint effects of 
selection, recombination and mutation. In this model, the state space 
of the ancestral process is determined by the extent of intact ancestral 
haplotypes in the vicinity of the selected mutant and the new 



Table 6 The sample configuration of the G6PD data according 
to coding rules in the section A simplified multilocus model for 
haplotype structure 



Haplotype Type 


Count Number 


(11, 7) 


5 


(4, 7) 


1 


(12, 7) 


1 


(17, 7) 


3 
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Selection coefficient 



0.12 



Figure 6 Likelihood curve for the G6PD data as a function of the 
selection coefficient with a constant population size of 10,000. The 
likelihood curve is smoothed by a local polynomial smoother. The 
point estimate of the selection coefficients is 0.0456 with the 95% 
confidence interval of (0.0144, 0.0769). 

mutations arising on the ancestral haplotypes during the selective 
process. Our method adopts importance sampling algorithms to 
efficiently explore the genealogical history of the sample for 
evaluation of the sampling probability. By applying the method to 
both simulated and real data, we demonstrate that the extent of the 
haplotype structure is informative for the inference of selection intensity 
of a recent positive selection. 

Our method has two merits. First, by exploiting the extent of 
haplotype structure and focusing only on subprocesses of the ARG 
related to the retained ancestral segments on the selected haplotypes, 
we dramatically reduce the computational burden such that data sets 
from genomic regions of mega-base magnitude can be analyzed. 
Second, our method can allow for changes in population size. This is 
especially important for samples from human populations outside 
Africa, because population growth can affect the pattern of linkage 
disequilibrium and haplotype structure, and thus lead to an incorrect 
estimation of the selection intensity if the effect of demographic 
history is not explicitly modeled. In our analysis of simulated data, we 
found that the estimated selection coefficient is accurate but is 
sensitive to the recombination rates assumed. Since the variability of 
recombination rates is high over human genome (Myers et al. 2005), 
good estimates of local rates are necessary to obtain accurate estimates 
of selection coefficients. 

Another factor that may affect the estimates of selection intensity 
and allele age is the SNP marker density in the data. Because mutants 
that have experienced positive selections are typically young, new 
mutations at nearby loci accumulate at a relatively slow rate compared 
to the rate of recombination that breaks down linkage disequilibrium. 
Therefore, we expect fewer segregating sites observed in regions under 
recent positive selection. In the low-density SNP data, these segregating 
sites are likely not typed. In the two data sets of the G6PD region we 
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Figure 7 The posterior distribution of the allele age in generations 
when the selection coefficient is set to the value estimated from Figure 6. 



analyzed, Sabeti et al. (2002)'s and Verrelli et al. (2002)'s data, no new 
mutations at closely linked loci were detected. Because their G6PD data 
were not generated by resequencing, a proportion of mutations may 
not have been identified or included in the data. We expect that 
resequencing data from the target gene regions will be more infor- 
mative for identifying the occurrence of recombinations and muta- 
tions during the selective process. The method developed in this 
paper is for identifying ongoing positive selection. If the selected 
allele has been fixed in the population, mutations accumulated since 
its fixation become informative and important for inferring the fix- 
ation time, for which the allele frequency spectrum after a selective 
sweep is a better choice (Chen 2012). 

The importance sampling algorithm for the genealogies adopted in 
this paper was developed by Griffiths and Tavare (1994b). In their 
proposal distribution, at each step any possible event that could lead to 
the current configuration is considered and sampled in proportion to 
their rate of occurrence (Felsenstein et al. 1999). More efficient pro- 
posal distributions have been developed (Stephens et al. 2001; Slatkin 
2002; De Lorio and Griffiths 2004; Paul et al. 2011) and can be 
adopted to improve the computational efficiency of our method. 
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APPENDIX A 

The probability distribution for the extent of an intact 
ancestral haplotype under selection 

We show how to derive the distribution for the extent of an intact 
ancestral haplotype at time t during a selective process. We investigate 
a selective sweep that starts from a single copy of the selected allele. 
Here we denote the time when the new selected mutant arises by 0, 
and look forward in time. Let X, be frequency of the selected allele at 
time t. If we ignore the initial randomness of allele frequency trajec- 
tories, which is usually modeled using a supercritical branching pro- 
cess, X t can be well approximated using the deterministic logistic 
equation (Ohta and Kimura, 1975): 



X t = 



Xn 



1-X a )e- 



(19) 



where X 0 can be set to 1/22V (Kaplan et al. 1989; Stephan et al. 1992). 
Assume the selected locus has two alleles A and a, with A being the 
advantageous allele. We are modeling a continuous segment between 
the selected mutant and the neutral marker, which has two alleles B 
and b. We use upper case letters to denote that the position is 
descended from the ancestral haplotype, and lower case letters to 
denote the position descended from the background haplotypes. 
Note here the two "alleles" are defined according to whether they 
are descended from the ancestral haplotype or not, instead of the 
true observed nucleotide types at that locus. We use [AB] to denote 
a segment of ancestral haplotype with loci A and B being the end 
points. Let P[ AB }(t) be the relative population frequency of such 
fragments among all haplotypes carrying the selected allele A at time 
t. Furthermore, we use [A — ] to denote an ancestral haplotype with A 
being from the ancestral haplotype, while the state of the other end 
point of the fragment is not determined. 

For a random ancestral haplotype [AB] or [Ab], if it recombines 
with any [A—] haplotype during the interval [0, r], it does not change 
PlAB](t)- The only possible change is caused by "effective" recombina- 
tions, that is, recombination with an [ab], or [aB] haplotype from the 
neutral haplotype "sub-population". The expected number of effective 



recombination events for a [AB] recombining with any [a—] haplo- 
type during the interval [0, t] is 



= r f u=0 (l ~ X(u))du 

= rt - -ln(l -X 0 + e st X 0 ) 



(20) 



It is not hard to see that the number of effective events on an [AB] 
ancestral haplotype during the time interval [0, t] follows a Poisson 
distribution. P[AB](t) is then identical to the probability of no effective 
recombination between [AB] for an ancestral haplotype: 



%5](f) =e" rf+ " ln(1 " X ° +£S%) 

= e- rt (l-(l-e* t )X 0 y /s . 



(21) 



Similarly, for an ancestral haplotype [ABC], the probability of 
being intact during the interval (0, t) follows the Equation 21, except 
that the recombination fraction is replaced by r[ AC ]- 

Note that in Equation 21, when st is small, which means either the 
selective process is at an early stage or the selection is weak, the term 
(1 — (1 — e 5 ') X 0 ) rA ~ 1, and thus similar to the neutral case. 
However, if st gets larger, the term cannot be ignored. For example, 
if r = 0.001, s = 0.01, X 0 = 0.001 and t = 1000, the relative bias can be 
as large as -27%. 

Also note that in the aforementioned derivation for the distribu- 
tion of ancestral haplotypes, we ignore the randomness of the 
frequency trajectory of the selected allele at the very early stage of 
the selective process, and approximate the trajectory with a determin- 
istic equation. Ignoring the randomness of the allele frequency 
trajectory at the early stage can bias the inference of parameters 
related to the sweep process, but as pointed out in previous studies 
(Kaplan et al. 1989; Braverman et al. 1995; Barton, 1998; Durrett and 
Schweinsberg, 2004; Etheridge et al. 2006), when the selection inten- 
sity is sufficiently strong, the bias is small. For this reason, our method 
is more suitable to analyze genes under strong selections. 
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